Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Thu Nov 19 21:48:27 2020"
Still a bit confused, I got back from fieldwork in Central Finland on Thursday and took the rest of the week off. But excited to start the course! Link to repo: https://github.com/vilnat/IODS-project
Describe the work you have done this week and summarize your learning.
date()
## [1] "Thu Nov 19 21:48:27 2020"
d14 <- read.table("data/learning2014.csv")
str(d14)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(d14)
## [1] 166 7
This dataset contains information about a survey study conducted in a intro course to social statistics in the University of Helsinki. The study focused on the relationship between learning approaches and students' achievements. The dataset contains background information of each student, such as the gender, age and how well they did in the course exam. It also contains aggregated information about the survey results in columns "deep", "stra" and "surf" which correspond to deep, strategic and surface learning.
This plot shows a summary of the variables in the dataset. The line graphs in the middle of the plot show the distribution in the data. Most of the data apart from background information is fairly normally distributed. Aggregated deep learning variable is a bit skewed towards high values. The scatter plots and numerical values surrounding the line graphs show correlations inside the data. Most strongly correlated are attitude towards statistics and how well the student did in the exam (positive correlation) as well as deep and surface learning (negative correlation).
library(ggplot2)
library(GGally)
p <- ggpairs(d14, mapping = aes(alpha = 0.5), lower = list(combo = wrap("facethist", bins = 20, colour = "cornflower blue")), diag = list(continuous = wrap("densityDiag", colour = "cornflower blue")))
p
Finally, here is a summary of the variables:
summary(d14)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
Here I built a linear model that explains exam success based on the variables attitude, strategic learning and age. I chose these variables based on backward elimination method in which you add all explanatory variables in the model and delete one by one the least significant variables until you're left with variables that are statistically significant.
The summary below shows the statistical aspects of the model. The residuals are relatively well normally distributed. Of the explanatory variables, attitude is clearly the most significant statistically. It has a strong positive influence on test results. While you could spend a while discussing whether ~0.05 is 'good enough', I chose to keep strategic learning and age in the model as they improve the r2 value. Higher strategic learning value in the model leads to higher test results whereas age has a slightly negative influence on them. Overall the model explains approximately 22 % of the exam results (multiple r2 = 0.2182) so a large part of the variation is still left unexplained.
m <- lm(points ~ attitude + stra + age, data = d14)
summary(m)
##
## Call:
## lm(formula = points ~ attitude + stra + age, data = d14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1149 -3.2003 0.3303 3.4129 10.7599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.89543 2.64834 4.114 6.17e-05 ***
## attitude 3.48077 0.56220 6.191 4.72e-09 ***
## stra 1.00371 0.53434 1.878 0.0621 .
## age -0.08822 0.05302 -1.664 0.0981 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared: 0.2182, Adjusted R-squared: 0.2037
## F-statistic: 15.07 on 3 and 162 DF, p-value: 1.07e-08
Linear models make certain assumptions concerning the nature of the data and the relationships between the model and the dependent variable. Model residuals show the difference between actual observations and the model. Linear models assume that these residuals are normally distributed and that their values don't depend on the dependent variable. The top-left figure below (residuals vs. fitted) shows the spread of the residuals across the fitted values. There doesn't seem to be any clear pattern although some of the residuals are somewhat separated from the general lump. Don't really know what is happening there but these residuals also aren't dependent on the fitted values.
The top-right figure shows how normally the residuals are distributed. For the most part, the residuals are close to the central line which means that they're mostly normally distributed aside from the highest negative residuals which are shown in the bottom left of that figure.
The final figure shows the influence of each individual observation on the model results. As seen, none of the observations have an unreasonably high influence on the model, meaning that there aren't any clear outliers that could be removed to improve it.
par(mfrow = c(2,2))
plot(m, which = c(1,2,5))
date()
## [1] "Thu Nov 19 21:48:35 2020"
This dataset contains information about students in two Portuguese schools. The data includes various attributes of the students such as their family background and social activities as well as grade data from two classes, math and Portuguese. The original data is available here: https://archive.ics.uci.edu/ml/datasets/Student+Performance
data <- read.table("data/alcohol_is_bad.csv", sep = ";", header = TRUE)
colnames(data)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The data also contains information about the students' alcohol consumption which gives us the opportunity to study what characteristics might correlate with it. For this, I chose four interesting variables that might or might not explain alcohol consumption.
"higher" = binary variables, whether or not the student wants to pursuit higher education. This might correlate negatively with alcohol usage, i.e. if you want to get higher education, you probably drink less (at least for now)
"activities" = binary variable, whether or not the student takes extra-curricular activities, again probably correlating negatively with alcohol usage, at least if we're thinking of overly stereotypical teenagers
"famrel" = a variable ranging from 1 to 5, describing how good the student's relationship is with their family. Might correlate negatively with alcohol usage (i.e. bad family relationship leads to drinking)
"goout" = again, a variable ranging from 1 to 5, describing how much the student goes out with friends, possibly correlating positively with alcohol, because what else would you do with friends?
Now, let's see how they actually compare to alcohol usage.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
g1 <- ggplot(data = data, aes(x=higher, y = alc_use))
g1 + geom_boxplot()
data %>% group_by(higher, high_use) %>% summarise(count = n())
## # A tibble: 4 x 3
## # Groups: higher [2]
## higher high_use count
## <fct> <lgl> <int>
## 1 no FALSE 9
## 2 no TRUE 9
## 3 yes FALSE 259
## 4 yes TRUE 105
From the boxplot it would seem that lesser alcohol consumption correlates with wanting higher education. However, as there are only 18 students out of almost 300 who don't want higher education, no actual conclusions can't very well be drwan and thus my hypothesis remains unproven.
g1 <- ggplot(data = data, aes(x=activities, y = alc_use))
g1 + geom_boxplot()
data %>% group_by(activities, high_use) %>% summarise(count = n())
## # A tibble: 4 x 3
## # Groups: activities [2]
## activities high_use count
## <fct> <lgl> <int>
## 1 no FALSE 122
## 2 no TRUE 59
## 3 yes FALSE 146
## 4 yes TRUE 55
Seems like there is no visible relationship between alcohol usage and extra-curricular activities. This is not that surprising and my original hypothesis was rather weak.
data$famrel <- as.character(data$famrel) #No idea what is wrong with this, I'd imagine it should work with integers as well but I guess not
g1 <- ggplot(data = data, aes(x=famrel, y = alc_use))
g1 + geom_boxplot()
data %>% group_by(famrel, high_use) %>% summarise(count = n())
## # A tibble: 10 x 3
## # Groups: famrel [5]
## famrel high_use count
## <chr> <lgl> <int>
## 1 1 FALSE 6
## 2 1 TRUE 2
## 3 2 FALSE 10
## 4 2 TRUE 9
## 5 3 FALSE 39
## 6 3 TRUE 25
## 7 4 FALSE 135
## 8 4 TRUE 54
## 9 5 FALSE 78
## 10 5 TRUE 24
There seems to be a small relationship between particularly good family relationships and less alcohol usage. The less clear relationship in the lower end might be caused by a smaller number of students. This is inline with my original assumptions, yey me.
data$goout <- as.character(data$goout) #No idea what is wrong with this, I'd imagine it should work with integers as well but I guess not
g1 <- ggplot(data = data, aes(x=goout, y = alc_use))
g1 + geom_boxplot()
At least there is a clear positive correlation with alcohol usage and going out with friends. I managed to guess something right.
And now it's midnight and I'm falling asleep. So thanks for grading this far, feel free to move on to the next person.
The dataset we're handling this week contains data about Boston's suburbs, their housing values and various variables related to that. Here are some examples of the variables included: * population-related variables, such as: student-to-teacher ratio, crime rate, lower status of the population * housing-related variables, such as: number of rooms, value of owner-occupied homes, proportion of owner-occupied units built prior to 1940 * infrastructure, such as: proportion of non-retail business acres per town, distance to Boston's employment centres * environmental variable: nitrogen oxides concentration
The dataset has 14 numerical variables of 506 suburbs of Boston, as shown below.
#Let's read the data
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
#And examine it a bit
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Let's examine our dataset a bit more closely. From the histograms below, we can see that the different variables are distributed rather differently. Some, such as crime rate (crim) and proportion of black people (black) are heavily skewed towards one end of the spectrum, accessibility to radial highways (rad) and property tax-rate (tax) are skewed towards both high and low values while the rest are somewhat normally or evenly distributed.
library(ggplot2)
library(GGally)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
hist.data.frame(Boston, nclass =10)
And below we see how strongly these variables correlate with each other. Strongest negative correlation can be found between correlate proportion of lower-status population (lstat) and value of owner occupied homes (medv), age and distance to employment centres (dis) as well as dis and concentration of nitrogen oxides (nox). Strongest positive correlation can be found between accessibility to highways and property-tax value.
library(tidyr)
library(corrplot)
## corrplot 0.84 loaded
cor_matrix <- round(cor(Boston), 2)
corrplot(cor_matrix, method="circle", type="upper", tl.pos="d", tl.cex = 0.6)
Next we will scale the data by substracting columns from their means and dividing the result with the column's standard deviation. As can be seen below, this changes each variable's mean to zero.
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
We'll also want the crime rate variable to be categorical which can be achieved easily by dividing it into classes based on its quantiles.
bins <- quantile(boston_scaled$crim)
labels <- c("low", "med_low", "med_high", "high")
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = labels)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
table(boston_scaled$crime)
##
## low med_low med_high high
## 127 126 126 127
Finally, we want to be able to use the data for testing predictions so we'll divide the data into training and testing sets.
n <- nrow(boston_scaled)
#We'll divide the data into 80 % training and 20 % testing sets.
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
Next we'll use linear discriminant analysis (LDA), a classification method, to investigate crime rate in more detail.
lda.fit <- lda(crime ~ ., data = train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
The figure above shows a point cloud of the observations in the training data on a plane consisting of the first two LDA-axes. The points are coloured by crime rate.
Now we'll test how well our method can predict crime rates based on the testing data. We'll first remove the actual crime rate data from the test data, then predict them and finally compare the model results to the actual data.
#Move the correct crime rates away from the test data
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
#Predict
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 15 7 1 0
## med_low 5 20 4 0
## med_high 0 4 17 3
## high 0 0 0 26
The model manages to predict high crime rates surprisingly well. Out of 22 high crime rate suburbs, the model gets 21 right. It works also relatively well for low and medium high crime rates (getting 20/26 and 22/28 correct respectively) but doesn't do so well with medium low crime rates of which it gets right only half of the suburbs.
Finally, we'll do a clustering analysis to the whole dataset using k-means. First we'll check the distances between observations. Then we'll do the analysis with a few different number of clusters to find out the optimal number.
data(Boston)
new_boston <- scale(Boston)
new_boston <- as.data.frame(new_boston)
dist_eu <- dist(new_boston)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(new_boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
Based on the plot above, it would look like possibly 2 clusters might be a good number as after that the line starts to decrease more gradually.
Now we'll run the k-means algorithm again and see how it looks like.
km <- kmeans(new_boston, centers = 2)
#Let's divide the dataset to slightly smaller chunks so we can actually see something.
km$cluster <- as.character(km$cluster)
par(mfrow = c(2,2))
ggpairs(new_boston, columns = 1:7, mapping = aes(alpha = 0.5, col = km$cluster))
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
ggpairs(new_boston, columns = 8:14, mapping = aes(alpha = 0.5, col = km$cluster))
That's a lot of little dots. In general it would seem that the clusters can be distinguished from quite a few scatter plots and distribution graphs. In the first plot, particularly proportion of non-retail business acres and concentration of nitrogen oxides are clearly divided into the two clusters. From the second plot property tax is also divided clearly. These would seem to play an important role in the clustering analysis. On the other hand, for example number of rooms per dwelling doesn't seem to play any role in this cluster analysis.